knitr::opts_chunk$set(echo = TRUE)
library(tidyverse); packageVersion("tidyverse")
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.2     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## [1] '1.3.0'
library(yaml); packageVersion("yaml")
## [1] '2.2.1'
library(here); packageVersion("here")
## here() starts at /Users/sdm8/Desktop/changeseq-analysis
## [1] '0.1'
library(stringr); packageVersion("stringr")
## [1] '1.4.0'

1 Summary

Determine sources of variability for the CHANGE-Seq assay.

2 Approach

  1. Samples are run through the CHANGE-Seq wet lab procedure and the library prepped smples are split between NIST & St. Jude.
  2. Samples are pooled and sequenced.
  3. Resulting FASTQ files are handed off to me and run through the CHANGE-Seq program.
  4. Each samples produces an “identified_matched.txt” file which is what is used and described in this document.

Downstream Bioinformatic Analysis:

  1. Loading CHANGE-Seq pipeline output (namely - “identified_matched.txt” files)
  2. Annotating CHANGE-Seq data frame with sample pair/ replicate information
  3. Summary/ Statistical Analysis
    1. Summary Table
    2. StJude - NIST Comparison

3 Data Overview

3.1 Samples A-F Spring 2020

Letter ID Target Site
A CCR5_site_3
B CCR5_site_8
C TRAC_site_1
D CXCR4_site_7
E CTLA4_site_9
F AAVS1_site_14

Experimental Factors N: NIST S: St. Jude

Acronym Wet Lab Sequencing Lab Bioinformatics
NNN NIST NIST NIST
NSN NIST St. Jude NIST
NSS NIST St. Jude St. Jude

NOTE SSN: refers to the CHANGE-Seq manuscript/publication samples. See below for pre-run steps **Pending review by Cicera as samples do not seem to be matched depth-wise when compared.

NOTE NNN2: NNN same library prep sequenced on an expired MiSeq kit (details below) to test before running new prepped AEF replicate samples. When these were of adequate quality, we incorporated them into this analysis.

St Jude has not yet re-run our sequencing of the original A-F samples. (NNS)

  1. Loading CHANGE-Seq pipeline output (namely - “identified_matched.txt” files)
  2. Annotating CHANGE-Seq data frame with sample pair/ replicate information
  3. Summary/ Statistical Analysis
    1. Summary Table
    2. Interlaboratory Comparisons

3.2 Data Pre-Run Steps

Reference Genome Cicera: > You can download hg38 sequence from: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/ and concatenate only the main chromosome sequences as hg38_chroms_only.fa

Downloaded all files with this naming scheme: “chrom1.fa.gz” + X & Y.

Concatenated files via:

cd ~/Desktop/reference_genome
cat *.fa > ~/Desktop/reference_genome/combined_hg38_chroms.fa

Trimmed NIST-sequenced files due to extra cycling with our kit via:

seqtk trimfq -e 150 \
A_S1_L001_R1_001.fastq.gz > A_S1_L001_R1_001_trimmed.fq

SSN files: Cicera: > Those samples were sequenced on a MiSeq V3 kit. For comparison of replicates, you will need to sample the gz files to the same sequence depth (between the replicates), and ideally, once you get the results, you would merge the tables and set a cutoff (read counts), because the low frequency sites are very likely to occur in one replicate but not in other. That’s what we use for the comparisons in figure 1. (edited)

you can use this command to check how many reads in each file:

parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz

Then, you can use this one to generate new files with the number of reads that you want:

seqtk sample -s100 read1.fq 1000 > sub1.fq

where 1000 is the number you want to sample to.

3.2.1 NN

Sample File Name File Size Depth
A, read 1 A_S1_L001_R1_001_trimmed.fq 956.9 MB 2631758
A, read 2 A_S1_L001_R2_001_trimmed.fq 874.8 MB 2631758
B, read 1 B_S2_L001_R1_001_trimmed.fq 647.9 MB 1780610
B, read 2 B_S2_L001_R2_001_trimmed.fq 623.7 MB 1780610
C, read 1 C_S3_L001_R1_001_trimmed.fq 743.2 MB 2049156
C, read 2 C_S3_L001_R2_001_trimmed.fq 706.6 MB 2049156
D, read 1 D_S4_L001_R1_001_trimmed.fq 935.6 MB 2586872
D, read 2 D_S4_L001_R2_001_trimmed.fq 739.4 MB 2586872
E, read 1 E_S5_L001_R1_001_trimmed.fq 741.1 MB 2056395
E, read 2 E_S5_L001_R2_001_trimmed.fq 701.1 MB 2056395
F, read 1 F_S6_L001_R1_001_trimmed.fq 845.6 MB 2339851
F, read 2 F_S6_L001_R2_001_trimmed.fq 817.3 MB 2339851
Neg Ctrl, read 1 neg-neg-control_S7_L001_R1_001_trimmed.fq 858.1 MB 2341975
Neg Ctrl, read 2 neg-neg-control_S7_L001_R2_001_trimmed.fq 803 MB 2341975

3.2.2 NS

Sample File Name File Size Depth
A, read 1 CRL1551_NIST_S1_R1_001.fastq 956.9 MB 2829840
A, read 2 CRL1551_NIST_S1_R2_001.fastq 874.8 MB 2829840
B, read 1 CRL1552_NIST_S2_R1_001.fastq 647.9 MB 2453359
B, read 2 CRL1552_NIST_S2_R2_001.fastq 623.7 MB 2453359
C, read 1 CRL1553_NIST_S3_R1_001.fastq 743.2 MB 2084289
C, read 2 CRL1553_NIST_S3_R2_001.fastq 706.6 MB 2084289
D, read 1 CRL1554_NIST_S4_R1_001.fastq 935.6 MB 2840809
D, read 2 CRL1554_NIST_S4_R2_001.fastq 739.4 MB 2840809
E, read 1 CRL1555_NIST_S5_R1_001.fastq 741.1 MB 2339453
E, read 2 CRL1555_NIST_S5_R2_001.fastq 701.1 MB 2339453
F, read 1 CRL1556_NIST_S6_R1_001.fastq 845.6 MB 2416305
F, read 2 CRL1556_NIST_S6_R2_001.fastq 817.3 MB 2416305
Neg Ctrl, read 1 neg-neg-CRL1557_NIST_S7_R1_001.fastq 858.1 MB 2604617
Neg Ctrl, read 2 neg-neg-CRL1557_NIST_S7_R2_001.fastq 803 MB 2604617

3.2.3 Notes Sample Sequencing

Sequencing Information for A-F Samples at NIST Original sequencing of A-F at NIST - Spring 2020 Re-sequencing of same library prep as spring2020 A-F samples at NIST. Run on expired MiSeq kit to test machine before running replicates of new prep A, E, F

To account for extra sequencing cycles, samples sequenced at NIST were trimmed via:

seqtk trimfq -e 150

Both locations pooled samples according to St. Jude’s qPCR quantification values.

Samples run with CHANGE-Seq v1.2.7

St Jude has not yet re-run our sequencing of the original A-F samples. (NNS)

NOTE About SSN Files

I sampled the files per Cicera’s instructions and checked to make sure the number of sequences matched those of St. Jude’s A-F sequencing of our samples. However, when I started to plot them with our other samples they drastically outweighed ours and made it very difficult to see our samples on the graphs. I will include two examples here and we can discuss how we want to proceed. We may just need to compare within sequencing depths and not try to down-sample.

Total read count of all samples showing how the SSN replicates even when down-sampling drowns out our samples

Total on-target read count with the down-sampled manuscript SSN replicates showing that it makes it very difficult to see our other samples when plotted together

4 Analysis

4.1 Loading and Munging data

Loading CHANGE-Seq output txt files into a single data frame

## Using message = FALSE to avoid printing read_tsv message about column specifications

## List of changeseq output files
changeseq_lst <- list.files(path = here("data-raw"), 
                           pattern = "identified_matched.txt",
                           include.dirs = TRUE, full.names = TRUE, 
                           recursive = TRUE) %>% 
    set_names(basename(.))

changeseq_df <- changeseq_lst %>%
    map_dfr(read_tsv, .id = "changeseq_run")
unique(changeseq_df$changeseq_run)
##  [1] "NNN2_A_CCR5_site_3_GM24385_identified_matched.txt"         
##  [2] "NNN2_B_CCR5_site_8_GM24385_identified_matched.txt"         
##  [3] "NNN2_C_TRAC_site_1_GM24385_identified_matched.txt"         
##  [4] "NNN2_D_CXCR4_site_7_GM24385_identified_matched.txt"        
##  [5] "NNN2_E_CTLA4_site_9_GM24385_identified_matched.txt"        
##  [6] "NNN2_F_AAVS1_site_14_GM24385_identified_matched.txt"       
##  [7] "NNN_A_CCR5_site_3_GM24385_identified_matched.txt"          
##  [8] "NNN_B_CCR5_site_8_GM24385_identified_matched.txt"          
##  [9] "NNN_C_TRAC_site_1_GM24385_identified_matched.txt"          
## [10] "NNN_D_CXCR4_site_7_GM24385_identified_matched.txt"         
## [11] "NNN_E_CTLA4_site_9_GM24385_identified_matched.txt"         
## [12] "NNN_F_AAVS1_site_14_GM24385_identified_matched.txt"        
## [13] "NSN_A_CRL1551_CCR5_site_3_GM24385_identified_matched.txt"  
## [14] "NSN_B_CRL1552_CCR5_site_8_GM24385_identified_matched.txt"  
## [15] "NSN_C_CRL1553_TRAC_site_1_GM24385_identified_matched.txt"  
## [16] "NSN_D_CRL1554_CXCR4_site_7_GM24385_identified_matched.txt" 
## [17] "NSN_E_CRL1555_CTLA4_site_9_GM24385_identified_matched.txt" 
## [18] "NSN_F_CRL1556_AAVS1_site_14_GM24385_identified_matched.txt"
## [19] "NSS_A_CRL1551_CCR5_site_3_GM24385_identified_matched.txt"  
## [20] "NSS_B_CRL1552_CCR5_site_8_GM24385_identified_matched.txt"  
## [21] "NSS_C_CRL1553_TRAC_site_1_GM24385_identified_matched.txt"  
## [22] "NSS_D_CRL1554_CXCR4_site_7_GM24385_identified_matched.txt" 
## [23] "NSS_E_CRL1555_CTLA4_site_9_GM24385_identified_matched.txt" 
## [24] "NSS_F_CRL1556_AAVS1_site_14_GM24385_identified_matched.txt"

NOTE - can modify data frame to add lab name, e.g. StJude or NIST, either using file paths or separate data frame. Annotating using file names

## This code is specific to the input file name format
changeseq_anno_df <- changeseq_df %>% 
    ## adding changeseq_run col, removing ending portion of string **NOTE THAT THIS WILL NEED CHANGED WHEN LOOKING AT OTHER DNA LINES**
    mutate(changeseq_run = str_remove(changeseq_run, "_GM24385_identified_matched.txt"),
           sample = "GM24385",
           ## target site name begins 5 characters into string of changeseq_run
          target_site = str_sub(changeseq_run, 5),
          ## further edit target site col if contains the St Jude file prefixes of CRLXXXX
           target_site = str_remove(target_site,  "CRL[:digit:]*_"),
            target_site = str_remove(target_site, "_sampled_"),
          ## lab col will contain the first 3 letter acronyms
          lab = substr(changeseq_run, 1,4),
             lab = str_remove(lab, "_")) %>%
     ## possible identifiers: NNN, NSN, NSS
     ## first = wet lab, second = sequence, third = bioinformatics
    ## take the 3 letters in lab col and split into 3 additional cols 
    separate(lab, into = c("wet_lab","seq_lab","bioinf"), 
           sep = c(1,2,3),
           remove = FALSE) %>%
  ## change the col contents from just letters to what they indicate
  mutate(wet_lab = if_else(wet_lab == "N", "NIST", "StJude"),
         seq_lab = if_else(seq_lab == "N", "NIST", "StJude"),
         bioinf = if_else(bioinf == "N", "NIST", "StJude"))

changeseq_anno_df ## Samantha may want this one written out

NOTE - The lab, target_site, and/ or changeseq_run variables can be used as grouping variables to calculate within dataset summarise.

4.2 Calculating Summary Metrics

Total Read Counts per Sample

total_count_df <- changeseq_anno_df %>% 
    group_by(changeseq_run, lab, target_site) %>% 
    summarise(total_read_count = sum(Nuclease_Read_Count))
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)

Total on-target read count

target_count_df <- changeseq_anno_df %>% 
    filter(Site_Substitution_Number == 0) %>% 
    rename(on_target_count = Nuclease_Read_Count) %>% 
    select(changeseq_run, lab, target_site, on_target_count)

Percent on-target read count (on-target read count/total read count)

on_off_target_df <- left_join(total_count_df, target_count_df) %>% 
    mutate(on_target_rate = on_target_count/total_read_count)
## Joining, by = c("changeseq_run", "lab", "target_site")

Total off-target read count Percent off-target read count (off-target read count/total read count)

summary_df <- changeseq_anno_df %>% 
    ## replace NAs in Site_Substitution_Number col
    mutate(Site_Substitution_Number = if_else(is.na(Site_Substitution_Number), -1, Site_Substitution_Number), 
           ## assign read count to on-target col of on-target row per sample
            on_target = if_else(Site_Substitution_Number == 0, Nuclease_Read_Count, 0),
           ## assign off-target read counts in non-on-target rows
            off_target = if_else(on_target == 0, Nuclease_Read_Count, 0),
            ## seqs with subs only = seq in Site_Sequence col only
            Sub_Only = if_else(is.na(Site_Sequence), FALSE, TRUE),
            ## seqs with indels only = seq in Site_Sequence_Gaps_Allowed col only
            Indel_Only = if_else(is.na(Site_Sequence_Gaps_Allowed), TRUE, FALSE),
            ## seqs with two possible reps = seq in both cols
            Sub_or_Indel = if_else(is.na(Sub_Only || Indel_Only), FALSE, TRUE))

Number of sequences with substitutions only Number sequences with indels only Number of sequences with two possible representations (substitutions or indels)

## create df to plot off-target ratio
off_target_ratio <- summary_df %>%  
    group_by(changeseq_run, lab, target_site) %>% 
    summarise(total_count = sum(Nuclease_Read_Count),
              total_on_target = sum(on_target),
              total_off_target = sum(off_target),
              off_target_ratio = total_off_target/total_count)
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)
## create df to plot alignment of off-target seq resolution details
off_target_seq_align_df <- summary_df%>%    
    group_by(changeseq_run, lab, target_site) %>%
    summarise(total_sub_only_seqs = sum(Sub_Only),
              total_indels_only = sum(Indel_Only),
              total_sub_or_indel = sum(Sub_or_Indel))
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)

NOTE CONSIDERING SEQUENCES THAT HAVE SUBSTITUTIONS ONLY… On-target to top 25% reads (on-target read count/top 25% off-target read count)

## create df of all off-target rows sorted by decreasing read count value
off_target_df <- summary_df %>%
    mutate(Substitution_Only = if_else(is.na(Site_Sequence), FALSE, TRUE),
           on_target = if_else(Site_Substitution_Number == 0, Nuclease_Read_Count, 0)) %>% 
    filter(Substitution_Only == TRUE) %>%
        arrange(-Nuclease_Read_Count)

## take the top 25% of the off-target df - based on read count value
top_quarter_off_target_df <- off_target_df %>%
    ## Use top_frac to get the top 25% - use group_by to capture within run or site analysis
      top_frac(0.25, wt = Nuclease_Read_Count) %>%
  group_by(changeseq_run, lab, target_site) %>% 
  summarise(top_quarter_offtarget_readcount = sum(Nuclease_Read_Count))
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)

On-target to most prevalent off-target read (on-target read count/off-target read with largest read count)

## create df to plot ratio between on-target read count and most prevalent off-target read count per sample
on_top_off_df <- summary_df %>% 
    mutate(Site_Substitution_Number = if_else(is.na(Site_Substitution_Number), -1, Site_Substitution_Number), 
            on_target = if_else(Site_Substitution_Number == 0, Nuclease_Read_Count, 0),
           off_target = if_else(on_target == 0, Nuclease_Read_Count, 0)) %>% 
    group_by(changeseq_run, lab, target_site) %>% 
    summarise(on_target_count = max(on_target),
              off_target_max = max(off_target)) %>% 
    mutate(on_off_max_ratio = on_target_count/ off_target_max)
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)

Read count for seqs w/ 1 substitution from target sequence Read count for seqs w/ 2 substitutions from target sequence Read count for seqs w/ 3 substitutions from target sequence Read count for seqs w/ 4 substitutions from target sequence Read count for seqs w/ 5 substitutions from target sequence Read count for seqs w/ 6 substitutions from target sequence

## create df to plot the binned read counts of off-target substitution numbers
substitution_readcount_df <- summary_df %>%
    mutate(Substitution_Only = if_else(is.na(Site_Sequence), FALSE, TRUE)) %>% 
    filter(Substitution_Only == TRUE,
           Site_Substitution_Number > 0) %>% 
    group_by(changeseq_run, lab, target_site, Site_Substitution_Number) %>% 
    summarise(substitution_read_count = sum(Nuclease_Read_Count))
## `summarise()` regrouping output by 'changeseq_run', 'lab', 'target_site' (override with `.groups` argument)

4.3 Table

## Making substitution count data frame wide for join
sub_counts_wide_df <- substitution_readcount_df %>% 
  mutate(Site_Substitution_Number = paste("sub",Site_Substitution_Number, "count", 
                                          sep = "_")) %>%
  pivot_wider(names_from = Site_Substitution_Number, 
              values_from = substitution_read_count, values_fill = 0)

## Combining summary metric data frames
summary_table <- left_join(off_target_ratio, off_target_seq_align_df) %>% 
  left_join(top_quarter_off_target_df) %>% 
  left_join(on_top_off_df) %>% 
  left_join(sub_counts_wide_df)
## Joining, by = c("changeseq_run", "lab", "target_site")
## Joining, by = c("changeseq_run", "lab", "target_site")
## Joining, by = c("changeseq_run", "lab", "target_site")
## Joining, by = c("changeseq_run", "lab", "target_site")
## write out summary table
## Note - do not need to use paste with here, here converts the vector of character strings to a file path
write_tsv(summary_table, path = here("data", "summary_stats_table.tsv"), 
          na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")
## Potentially use the DT or other packages for visually pleasing table presentation
summary_table %>% DT::datatable()

Summary Stats Table Column Headers

[1] “changeseq_run”: lab identifiers, sample letter ID, target site
[2] “lab”: 3-letter identifiers for wet lab, sequencing lab, and bioinformatics lab; NIST, St. Jude
[3] “target_site”: target site
[4] “total_count”: total read count for that sample (both on and off-target)
[5] “total_on_target”: total read count for the on-target sequence for that sample
[6] “total_off_target”: total read count for the off-target sequences for that sample
[7] “off_target_ratio”: total on-target read count/total off-target read count
[8] “total_sub_only_seqs”: total read count for the off-target sequences that are resolved in alignment with nucleotide substitutions only
[9] “total_indels_only”: total read count for the off-target sequences that are resolved in alignment with gaps (indels) only
[10] “total_sub_or_indel”: total read count for the off-target sequences that are resolved in alignment with substitutions or gaps
[11] “top_quarter_offtarget_readcount”: the total read count for the top quarter of off-target reads (sorted by decreasing read count) [12] “off_target_max”: the read count for the most prevalent off-target sequence (off-target sequence with largest read count)
[13] “on_off_max_ratio”: on-target read count/read count for most prevalent off-target sequence
[14] sub_X_count: the unique substitution number found across the samples with the total number of occurrences of that read number in the off-target coordinates across each sample

4.4 Plots

4.4.1 Total Read Counts & On-Target Read Counts per Sample

For each target site, each lab’s total read count is plotted. This gives an idea how much data is generated as a whole for each target site. For each target site, each lab’s total on-target read count is plotted.

ggplot(total_count_df) + 
    geom_bar(aes(x = target_site, y = total_read_count, fill = lab), 
             position = "dodge", stat = "identity") +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(x = "Editing Target Site", y = "Total Read Count", 
         fill = "Lab") 

ggplot(target_count_df) + 
    geom_bar(aes(x = target_site, y = on_target_count, fill = lab), 
             position = "dodge", stat = "identity") +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(x = "Editing Target Site", y = "Total On-Target Read Count", 
         fill = "Lab")

Sample F for NSS has a drastically lower read count. This was found to be a copy/paste error in their manifest file. They were pointing to sample E’s read fastq files.

4.4.2 On-Target Rate

For each target site, each lab’s on target rate (total on-target read count / total read count) is plotted.

ggplot(on_off_target_df) + 
    geom_point(aes(x = target_site, y = on_target_rate, fill = lab), 
               shape = 21, alpha = 0.5) + 
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90)) +
        labs(x = "Editing Target Site", y = "On Target Rate", 
         fill = "Lab")

On-target rates for sample A, E, and F are nearly matched and variance for samples B, C, and D is observed. NIST-sequenced samples are slightly higher.

4.4.3 Off-Target Rate

For each target site, each lab’s off-target rate (off-target read count / total read count) is plotted.

ggplot(off_target_ratio) + 
    geom_point(aes(x = target_site, y = off_target_ratio, fill = lab), 
               shape = 21, alpha = 0.5) + 
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90)) +
        labs(x = "Editing Target Site", y = "Off Target Rate", 
         fill = "Lab")

The rates of the off-target sequences are generally all higher than that of the on-target rates. Like the on-target rates, samples A, E, and F are similar in value. Samples B, C, and D have variance with the inverse of the above - the off-target rates are higher for the St. Jude-sequenced samples than that of the NIST-sequenced samples.

4.4.4 Off-target Sequence Alignment: Sub/Gap

For each target site, the number of off-target sequences that are resolved by substitutions only, indels (gaps) only, or substitutions or gaps are plotted for each lab.

off_target_seq_align_df %>% 
    ## First make the table long
    gather(key = "metric", value = "value", -changeseq_run, -lab, -target_site)
off_target_seq_align_df %>% 
    ## First make the table long
    gather(key = "metric", value = "value", -changeseq_run, -lab, -target_site) %>% 
    ggplot() +
    geom_bar(aes(x = metric, 
                 y = value,
                 fill = lab),
             position = "dodge", 
             stat = "identity"
             ) +
    ## Split plot by summary metric
    facet_wrap(~target_site, scales = "free_y") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = -45, hjust = 0)) +
  labs(x = "Mismatch Type", y = "Number of Sequences", fill = "Lab")

The off-target sequences are aligned to the target sequence. These are binned in the following manner: in order to align properly to the target sequence, the off-target sequence could be resolve with substitutions (and assigned a substitution number), gaps (insertions or deletions) or by both possibilities. Here, I have binned them accordingly. As before, no differences are observed between the samples sequenced at St. Jude but run through the CHANGE-Seq bioinformatics pipeline by me and St. Jude, however, variance is observed between the samples sequenced at NIST vs. St. Jude. For samples A and B, NIST-sequenced off-target reads are overall in lower numbers as compared to St. Jude.

4.4.5 Top 25% off-target read counts

For each target site, weighted by read count, the top 25% off-target sequence read counts for each lab are plotted.

ggplot(top_quarter_off_target_df) + 
    geom_bar(aes(x = target_site, y = top_quarter_offtarget_readcount, fill = lab), 
             position = "dodge", stat = "identity") +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(x = "Total Read Count of Top 25% Off-Targets", y = "Editing Target Site", 
         fill = "Lab")

4.4.6 Ratio of on-target read counts to the off-target sequence with the highest read count

The on-target read count divided by the most prevalent off-target sequence (the off-target sequence with the largest read count) for each sample is plotted for each lab.

ggplot(on_top_off_df) + 
    geom_point(aes(x = target_site, y = on_off_max_ratio, fill = lab), 
               shape = 21, alpha = 0.5) + 
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90)) +
        labs(x = "Editing Target Site", y = "On-Target Rate to Most Prevalent Off-Target", 
         fill = "Lab")

As seen with the on-target rate considering total read counts, NIST has higher rates than St. Jude-sequenced samples.

4.4.7 Read counts per subsitution number for each sample

For each substitution number (or for indels only or subs/indels which are grouped separately), the number of off-target sequences that are classified as such as such are plotted for each target sample and lab.

NOTE I currently have these plotting read counts but we may want to either replace this with or add another graph that shows occurrences for each sub number.

indel_df <- off_target_seq_align_df %>% 
  select(-total_sub_only_seqs) %>% 
  pivot_longer(names_to = "diff_cat", values_to = "read_count", 
               cols = c("total_indels_only","total_sub_or_indel"))

diff_cat_read_count_df <- substitution_readcount_df %>% 
  mutate(diff_cat = str_c("Substitution #", Site_Substitution_Number, sep = "")) %>% 
  select(-Site_Substitution_Number) %>% 
  rename(read_count = substitution_read_count) %>% 
  bind_rows(indel_df)

ggplot(diff_cat_read_count_df) + 
    geom_bar(aes(x = diff_cat, 
                 y = read_count, fill = lab), 
             position = "dodge", stat = "identity") +
  facet_wrap(~target_site) + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(x = "Difference Type", y = "Read Count", 
         fill = "Lab")

4.5 Venn Bar Plots

As opposed to the typical venn diagrames, these show relative abundance and give appropriate scaling to visualize values.

NOTE NATE - can we add the values on the graph??

off_target_scatter_df <- off_target_df %>% 
  select(Chromosome, Start, End, Nuclease_Read_Count, target_site, lab)

off_target_lab_set <- off_target_df %>%
  select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>% 
  group_by(`Genomic Coordinate`, target_site) %>% 
  summarise(lab_set = str_c(sort(lab), collapse = "-"))
## `summarise()` regrouping output by 'Genomic Coordinate' (override with `.groups` argument)
venn_df <- off_target_df %>%
  select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>%
  pivot_wider(names_from = "lab", values_from = "Nuclease_Read_Count", values_fill = NA) %>%
  left_join(off_target_lab_set)
## Joining, by = c("Genomic Coordinate", "target_site")
ggplot(data = venn_df) + 
  geom_bar(aes(x = target_site, fill = lab_set), position = "fill") + 
  coord_flip() + 
  theme_bw() + 
  labs(x = "Target Site", 
       y = "Proportion of Coordinates", 
       fill = "Lab Combinations")

4.6 Read Count Replicate v Replicate Scatter Plots

4.6.1 Overlapping Off-Target Coordinates

To mimic the plots of Figure1, i in the CHANGE-Seq manuscript, common genomic coordinates of off-targets for each sample replicates are plotted in log scale.

## NSN vs. NSS: Concordant Off-Target Coordinates
nsn_v_nss <- off_target_scatter_df %>% 
  pivot_wider(names_from = "lab", values_from = "Nuclease_Read_Count", values_fill = 0) %>% 
   filter(NSS != 0, NSN != 0) %>%
  ggplot(aes(x = NSN, y = NSS)) + 
   geom_point(alpha = 0.25) + 
   ggpubr::stat_cor(aes(label = ..rr.label..), method = "pearson") + 
  scale_x_log10() + 
  scale_y_log10() + 
  facet_wrap(~target_site, ncol = 1) + 
  theme_bw()

## NSN vs. NNN: Concordant Off-Target Coordinates
nsn_v_nnn <- off_target_scatter_df %>% 
  pivot_wider(names_from = "lab", values_from = "Nuclease_Read_Count", values_fill = 0) %>% 
   filter(NSN != 0, NNN != 0) %>%
  ggplot(aes(x = NSN, y = NNN)) + 
   geom_point(alpha = 0.25) + 
   ggpubr::stat_cor(aes(label = ..rr.label..), method = "pearson") + 
  scale_x_log10() + 
  scale_y_log10() + 
  facet_wrap(~target_site, ncol = 1) + 
  theme_bw()

## NSS vs. NNN: Concordant Off-Target Coordinates
nss_v_nnn <- off_target_scatter_df %>% 
  pivot_wider(names_from = "lab", values_from = "Nuclease_Read_Count", values_fill = 0) %>% 
   filter(NSS != 0, NNN != 0) %>%
  ggplot(aes(x = NSS, y = NNN)) + 
   geom_point(alpha = 0.25) + 
   ggpubr::stat_cor(aes(label = ..rr.label..), method = "pearson") + 
  scale_x_log10() + 
  scale_y_log10() + 
  facet_wrap(~target_site, ncol = 1) + 
  theme_bw()
ggpubr::ggarrange(nsn_v_nss, nsn_v_nnn, nss_v_nnn, ncol = 3)

4.7 Unique Read Count Occurrence Plots

The read count values for the off-target coordinates unique to replicate 1, replicate 2, and the intersection of the replicates was extracted. Unique read count values from all 3 data were then used to tally each occurrence across the unique genomic coordinates for replicate 1, replicate 2. and the intersection of the replicates

Our traditional way of viewing

nnn_nsn_lab_set <- off_target_df %>%
  select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>% 
  group_by(`Genomic Coordinate`, target_site) %>% 
  filter(lab != "NSS") %>% 
  summarise(lab_set = str_c(sort(lab), collapse = "-"))
## `summarise()` regrouping output by 'Genomic Coordinate' (override with `.groups` argument)
rep_count_scatter_df <- off_target_df %>%
  select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>% 
  filter(lab != "NSS") %>% 
  right_join(nnn_nsn_lab_set) %>% 
  count(target_site, lab_set, Nuclease_Read_Count, name = "Occurence")
## Joining, by = c("Genomic Coordinate", "target_site")
ggplot(rep_count_scatter_df) + 
  geom_point(aes(x = Nuclease_Read_Count, y = Occurence, fill = lab_set), shape = 21) + 
  scale_x_log10() + 
  facet_wrap(~target_site, scales = "free_y", ncol = 1) +
  theme_bw()

Density view

nnn_nsn_lab_set <- off_target_df %>%
  select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>% 
  group_by(`Genomic Coordinate`, target_site) %>% 
  filter(lab != "NSS") %>% 
  summarise(lab_set = str_c(sort(lab), collapse = "-"))
## `summarise()` regrouping output by 'Genomic Coordinate' (override with `.groups` argument)
rep_count_scatter_df <- off_target_df %>%
  select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>% 
  filter(lab != "NSS") %>% 
  right_join(nnn_nsn_lab_set)
## Joining, by = c("Genomic Coordinate", "target_site")
ggplot(rep_count_scatter_df) + 
  geom_histogram(aes(x = Nuclease_Read_Count, fill = lab_set)) + 
  scale_x_log10() + 
  facet_wrap(~target_site, scales = "free_y", ncol = 1) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is plotting the proportion of sites (occurrence / total read count) as opposed to just occurrence number as an attempt to normalize

rep_count_scatter_df %>% 
  mutate(lab_set = factor(lab_set, level = c("NNN-NSN", "NNN","NSN"))) %>% 
ggplot() + 
  geom_histogram(aes(x = Nuclease_Read_Count, fill = lab_set), position = "fill") + 
  scale_x_log10() + 
  facet_wrap(~target_site, scales = "free_y", ncol = 1) +
  theme_bw() +
  scale_fill_brewer(type = "qual", palette = 2) +
  labs(x = "Read Count", y = "Proportion of Sites")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 331 rows containing missing values (geom_bar).

5 Session Information

5.1 System Information

sessioninfo::platform_info()
##  setting  value                       
##  version  R version 4.0.1 (2020-06-06)
##  os       macOS Catalina 10.15.5      
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2020-08-21

5.2 Package Versions

sessioninfo::package_info() %>% 
    filter(attached = TRUE) %>% 
    select(package, loadedversion, date, source) %>%
    knitr::kable(booktabs = TRUE, row.names = FALSE)
package loadedversion date source
abind 1.4-5 2016-07-21 CRAN (R 4.0.2)
assertthat 0.2.1 2019-03-21 CRAN (R 4.0.0)
backports 1.1.8 2020-06-17 CRAN (R 4.0.0)
blob 1.2.1 2020-01-20 CRAN (R 4.0.0)
bookdown 0.20 2020-06-23 CRAN (R 4.0.2)
broom 0.5.6 2020-04-20 CRAN (R 4.0.0)
car 3.0-9 2020-08-11 CRAN (R 4.0.1)
carData 3.0-4 2020-05-22 CRAN (R 4.0.2)
cellranger 1.1.0 2016-07-27 CRAN (R 4.0.0)
cli 2.0.2 2020-02-28 CRAN (R 4.0.0)
colorspace 1.4-1 2019-03-18 CRAN (R 4.0.0)
cowplot 1.0.0 2019-07-11 CRAN (R 4.0.2)
crayon 1.3.4 2017-09-16 CRAN (R 4.0.0)
crosstalk 1.1.0.1 2020-03-13 CRAN (R 4.0.2)
curl 4.3 2019-12-02 CRAN (R 4.0.0)
data.table 1.12.8 2019-12-09 CRAN (R 4.0.0)
DBI 1.1.0 2019-12-15 CRAN (R 4.0.0)
dbplyr 1.4.4 2020-05-27 CRAN (R 4.0.0)
digest 0.6.25 2020-02-23 CRAN (R 4.0.0)
dplyr 1.0.0 2020-05-29 CRAN (R 4.0.0)
DT 0.14 2020-06-24 CRAN (R 4.0.2)
ellipsis 0.3.1 2020-05-15 CRAN (R 4.0.0)
evaluate 0.14 2019-05-28 CRAN (R 4.0.0)
fansi 0.4.1 2020-01-08 CRAN (R 4.0.0)
farver 2.0.3 2020-01-16 CRAN (R 4.0.0)
forcats 0.5.0 2020-03-01 CRAN (R 4.0.0)
foreign 0.8-80 2020-05-24 CRAN (R 4.0.1)
fs 1.4.2 2020-06-30 CRAN (R 4.0.1)
generics 0.0.2 2018-11-29 CRAN (R 4.0.0)
ggplot2 3.3.2 2020-06-19 CRAN (R 4.0.0)
ggpubr 0.4.0 2020-06-27 CRAN (R 4.0.2)
ggsignif 0.6.0 2019-08-08 CRAN (R 4.0.2)
glue 1.4.1 2020-05-13 CRAN (R 4.0.0)
gtable 0.3.0 2019-03-25 CRAN (R 4.0.0)
haven 2.3.1 2020-06-01 CRAN (R 4.0.0)
here 0.1 2017-05-28 CRAN (R 4.0.0)
hms 0.5.3 2020-01-08 CRAN (R 4.0.0)
htmltools 0.5.0 2020-06-16 CRAN (R 4.0.0)
htmlwidgets 1.5.1 2019-10-08 CRAN (R 4.0.0)
httr 1.4.1 2019-08-05 CRAN (R 4.0.0)
jsonlite 1.7.0 2020-06-25 CRAN (R 4.0.0)
knitr 1.29 2020-06-23 CRAN (R 4.0.1)
labeling 0.3 2014-08-23 CRAN (R 4.0.0)
lattice 0.20-41 2020-04-02 CRAN (R 4.0.1)
lifecycle 0.2.0 2020-03-06 CRAN (R 4.0.0)
lubridate 1.7.9 2020-06-08 CRAN (R 4.0.0)
magrittr 1.5 2014-11-22 CRAN (R 4.0.0)
modelr 0.1.8 2020-05-19 CRAN (R 4.0.0)
munsell 0.5.0 2018-06-12 CRAN (R 4.0.0)
nlme 3.1-148 2020-05-24 CRAN (R 4.0.1)
openxlsx 4.1.5 2020-05-06 CRAN (R 4.0.2)
pillar 1.4.4 2020-05-05 CRAN (R 4.0.0)
pkgconfig 2.0.3 2019-09-22 CRAN (R 4.0.0)
purrr 0.3.4 2020-04-17 CRAN (R 4.0.0)
R6 2.4.1 2019-11-12 CRAN (R 4.0.0)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 4.0.0)
Rcpp 1.0.5 2020-07-06 CRAN (R 4.0.1)
readr 1.3.1 2018-12-21 CRAN (R 4.0.0)
readxl 1.3.1 2019-03-13 CRAN (R 4.0.0)
reprex 0.3.0 2019-05-16 CRAN (R 4.0.0)
rio 0.5.16 2018-11-26 CRAN (R 4.0.2)
rlang 0.4.6 2020-05-02 CRAN (R 4.0.0)
rmarkdown 2.3 2020-06-18 CRAN (R 4.0.0)
rprojroot 1.3-2 2018-01-03 CRAN (R 4.0.0)
rstatix 0.6.0 2020-06-18 CRAN (R 4.0.2)
rstudioapi 0.11 2020-02-07 CRAN (R 4.0.0)
rvest 0.3.5 2019-11-08 CRAN (R 4.0.0)
scales 1.1.1 2020-05-11 CRAN (R 4.0.0)
sessioninfo 1.1.1 2018-11-05 CRAN (R 4.0.2)
stringi 1.4.6 2020-02-17 CRAN (R 4.0.0)
stringr 1.4.0 2019-02-10 CRAN (R 4.0.0)
tibble 3.0.2 2020-07-07 CRAN (R 4.0.1)
tidyr 1.1.0 2020-05-20 CRAN (R 4.0.0)
tidyselect 1.1.0 2020-05-11 CRAN (R 4.0.0)
tidyverse 1.3.0 2019-11-21 CRAN (R 4.0.0)
vctrs 0.3.1 2020-06-05 CRAN (R 4.0.0)
withr 2.2.0 2020-04-20 CRAN (R 4.0.0)
xfun 0.15 2020-06-21 CRAN (R 4.0.1)
xml2 1.3.2 2020-04-23 CRAN (R 4.0.0)
yaml 2.2.1 2020-02-01 CRAN (R 4.0.0)
zip 2.1.0 2020-08-10 CRAN (R 4.0.1)